雖然我們為了數學上的便利,將資料視覺化為二維格子,但硬體僅能看見一個 連續的 1 維字節流。理解這種「線性現實」是實現列向運算模式的前提 簡化模式——例如尋找最大值或指數總和。
1. 「線性展平」原理
每個多維張量在物理上都是依序儲存的。要實作 $\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$,我們必須識別代表一列的線性區段,並進行遍歷以計算最大值與總和。
2. 數值穩定性
為什麼 Softmax 需要穩定化?高輸入值會導致 $e^{x}$ 崩潰。我們透過以下方式進行穩定化:$$\text{exp}(x_i - \text{max}(x))$$ 這迫使核心設計者在最終歸一化前執行兩次線性簡化(先求最大值,再求總和)。
3. 透過短列驗證
在開發 Triton 核心時,我們使用 僅測試短列 (例如寬度為 16)來確保我們的線性指標運算能正確捕捉每一筆元素,才逐步擴展至生產環境工作負載。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
How are 2D tensors physically arranged in GPU memory?
As nested hardware folders.
As a contiguous 1D stream of bytes.
In a hexagonal lattice.
As independent scalar registers.
✅ Correct!
Regardless of dimensionality, memory is fundamentally linear; strides define the 'jumps' between rows.❌ Incorrect
Modern RAM and Global Memory are addressed linearly.QUESTION 2
What is the primary reason for performing a row-wise max reduction before exponentiation?
To sort the data for faster access.
To ensure numerical stability and prevent overflow.
To reduce the memory footprint of the tensor.
To align the data with 32-byte boundaries.
✅ Correct!
Subtracting the maximum ensures the largest exponent is $e^0 = 1$, preventing float16/float32 overflow.❌ Incorrect
This is known as the 'Max Trick' for numerical stability in Softmax.QUESTION 3
In the context of the Linear Reality, what is a reduction pattern?
The process of deleting unused rows.
Compressing the tensor using ZIP algorithms.
Aggregating multiple values into a single statistic (e.g., sum, max).
Reducing the clock speed of the GPU.
✅ Correct!
Reductions iterate over a dimension to produce a scalar per row/column.❌ Incorrect
Reductions are about data aggregation, not compression or deletion.QUESTION 4
Why is testing performed on 'short rows' first?
Short rows consume more power.
To verify indexing logic without complex tiling overhead.
Short rows are stored in L1 cache only.
Triton cannot handle rows longer than 1024.
✅ Correct!
Verifying small cases ensures the mathematical mapping between 2D indices and 1D memory is correct.❌ Incorrect
It is a debugging strategy, not a hardware limitation.QUESTION 5
Which formula represents the stable version of Softmax?
$$e^{x_i} / \sum e^{x_j}$$
$$\text{max}(x) / \text{sum}(x)$$
$$\frac{e^{x_i - \max(x)}}{\sum e^{x_j - \max(x)}}$$
$$x_i - \text{avg}(x)$$
✅ Correct!
Subtracting the max from every element before exponentiation maintains the same mathematical ratio while staying within numerical bounds.❌ Incorrect
The naive formula (Option A) is mathematically correct but numerically unstable.Case Study: Numerical Overflow in Softmax
Stabilizing FP16 Kernels
A developer implements a Softmax kernel for a model using FP16 precision. The inputs for one row are [10.0, 20.0, 80.0]. The kernel produces 'NaN' outputs.
Q
Why did the 'NaN' occur, and what is the maximum value $e^x$ can reach in FP16 before overflowing?
Solution:
In FP16, the maximum representable value is 65,504. $e^{80}$ is significantly larger than this, resulting in 'inf'. Dividing by 'inf' or encountering it in sums leads to 'NaN' results. The 'Max Trick' is required.
In FP16, the maximum representable value is 65,504. $e^{80}$ is significantly larger than this, resulting in 'inf'. Dividing by 'inf' or encountering it in sums leads to 'NaN' results. The 'Max Trick' is required.
Q
Describe the implementation of row-wise max and sum for this scenario.
Solution:
First, perform a pass over the linear memory segment for the row to identify '80.0' as the max. Subtract 80.0 from all elements ([ -70, -60, 0 ]). Then compute $e^{-70}, e^{-60}, e^0$. The sum is now stable (approx 1.0), and the final division proceeds without overflow.
First, perform a pass over the linear memory segment for the row to identify '80.0' as the max. Subtract 80.0 from all elements ([ -70, -60, 0 ]). Then compute $e^{-70}, e^{-60}, e^0$. The sum is now stable (approx 1.0), and the final division proceeds without overflow.